Understanding Consumer Behavior Through Geospatial and Time Data

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
%matplotlib inline
import plotly.offline as py
py.offline.init_notebook_mode(connected=True)

DATA CLEANING/WRANGLING

A note on the data

I added the latitude and longitude of the store city from the Open Map API. The script is on a separate file (get_coordinates.py) that extracts the data online and merges it with the part2.csv. If you need to run this, pls change the file paths accordingly.

In [2]:
data = pd.read_csv("/home/kla/Documents/tests/data_coordinates.csv", delimiter=",", header=0, error_bad_lines=False, low_memory=True)
In [3]:
data.shape
Out[3]:
(1037495, 14)
In [4]:
data.head(5)
Out[4]:
Unnamed: 0 user_id user_platform_category user_platform_os user_zip store_zip store_city store_id store_federal_state visit_start visit_end sector_name visit_duration_minutes coordinates
0 0 0a175625-cf3f-4a9b-b43b-a39576564c6f phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-02-06T15:23:52.599 2020-02-06T15:42:52.599 Fast Food 19 (48.2592477, 11.4354419)
1 1 2bc6cccc-1a10-4564-a66f-e52a379e0c07 phone.native android 80997.0 85221 Dachau 100716641 Bayern 2020-01-31T19:32:27.289 2020-01-31T21:27:27.289 Fast Food 115 (48.2592477, 11.4354419)
2 2 3464b2c2-2bc2-4080-81b4-fa003597ca02 phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-02-02T11:52:31.590 2020-02-02T12:02:31.590 Fast Food 10 (48.2592477, 11.4354419)
3 3 f90f6dd7-37aa-49b1-a53e-861da4bed346 phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-02-11T20:12:10.969 2020-02-11T20:47:10.969 Fast Food 35 (48.2592477, 11.4354419)
4 4 ce08f498-ea08-4c5e-a1c9-770cdac6763f phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-03-26T13:05:09.999 2020-03-26T14:14:09.999 Fast Food 69 (48.2592477, 11.4354419)
In [5]:
#separating the coordinates column into latitude and longitude

coords = data["coordinates"].str[1:-1].str.split(',', expand=True).astype(float)
coords.columns = ["lat", "long"]
data = pd.concat([data, coords], axis=1)
In [6]:
#transforming dates to datetime objects, parsing them, adding day of week column
data["visit_start"] =pd.to_datetime(data["visit_start"])
data['visit_start_date'] = pd.to_datetime(data['visit_start']).dt.date
data['visit_start_hour'] = pd.to_datetime(data['visit_start']).dt.time
data['visit_end_date'] = pd.to_datetime(data['visit_end']).dt.date
data['visit_end_hour']=pd.to_datetime(data['visit_end']).dt.time
data['visit_day'] = pd.to_datetime(data['visit_start']).dt.day_name()
In [7]:
#ordering the days of the week
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
data['visit_day'] = pd.Categorical(data['visit_day'], categories=days, ordered=True)
In [8]:
data.head(3)
Out[8]:
Unnamed: 0 user_id user_platform_category user_platform_os user_zip store_zip store_city store_id store_federal_state visit_start ... sector_name visit_duration_minutes coordinates lat long visit_start_date visit_start_hour visit_end_date visit_end_hour visit_day
0 0 0a175625-cf3f-4a9b-b43b-a39576564c6f phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-02-06 15:23:52.599 ... Fast Food 19 (48.2592477, 11.4354419) 48.259248 11.435442 2020-02-06 15:23:52.599000 2020-02-06 15:42:52.599000 Thursday
1 1 2bc6cccc-1a10-4564-a66f-e52a379e0c07 phone.native android 80997.0 85221 Dachau 100716641 Bayern 2020-01-31 19:32:27.289 ... Fast Food 115 (48.2592477, 11.4354419) 48.259248 11.435442 2020-01-31 19:32:27.289000 2020-01-31 21:27:27.289000 Friday
2 2 3464b2c2-2bc2-4080-81b4-fa003597ca02 phone.native android 85221.0 85221 Dachau 100716641 Bayern 2020-02-02 11:52:31.590 ... Fast Food 10 (48.2592477, 11.4354419) 48.259248 11.435442 2020-02-02 11:52:31.590000 2020-02-02 12:02:31.590000 Sunday

3 rows × 21 columns

In [9]:
#Due to the large file and the slowness of my laptop, I'm sampling only a quarter of the dataset
data = data.sample(frac=.25)

DATA EXPLORATION AND ANALYSIS

In [10]:
data.describe()
Out[10]:
Unnamed: 0 user_zip store_zip store_id visit_duration_minutes lat long
count 2.593740e+05 259165.000000 259374.000000 2.593740e+05 259374.000000 258791.000000 258791.000000
mean 5.193847e+05 53259.001825 53267.402562 2.606071e+08 33.600793 50.642966 11.873574
std 2.998078e+05 36884.422451 36919.632784 3.309191e+08 25.997787 2.725775 1.834449
min 0.000000e+00 1067.000000 10115.000000 1.872000e+03 10.000000 -25.936344 -47.797089
25% 2.596212e+05 13189.000000 13156.000000 4.405700e+04 15.000000 48.529674 10.898697
50% 5.194945e+05 80539.000000 80336.000000 1.007173e+08 24.000000 50.062799 11.575382
75% 7.791805e+05 89231.000000 89231.000000 4.406480e+08 43.000000 52.517037 13.388860
max 1.037491e+06 99096.000000 97909.000000 1.237829e+09 120.000000 60.394306 128.901423
In [11]:
data.shape
Out[11]:
(259374, 21)
In [12]:
sector_sequence = {"sector_name": ["Baumarkt", "Discounter", "Möbelhäuser", "Mode", "Fast Food"]}
color_sequence = ["red", "green", "blue", "goldenrod", "pink"]

Geographical Distribution

Here we want to see the distribution of the activities per state and per sector. As we can see Bayern/ Southern Germany has the biggest cluster of activities. Zooming in on the maps, Discounters and Furniture places have the highest density

In [13]:
def germany_count():
    sample_data = data.sample(n=50000, random_state=1)
    fig = px.scatter_mapbox(sample_data, lat="lat", lon="long", hover_name="store_city", color="sector_name", hover_data=["sector_name", "visit_duration_minutes"],
                            color_discrete_sequence=color_sequence, category_orders= sector_sequence, zoom=5, height=500)
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    fig.show()
    py.offline.init_notebook_mode(connected=True)
In [14]:
germany_count()
py.offline.init_notebook_mode(connected=True)

In Berlin, Discounters and Furniture Stores have a lot of activities. This could be due to the fact that it is a major urban center and people are always moving in. It is not a big place for Fast Food; in fact, the major point for it is concentrated only in Alexanderplatz area.

In [15]:
def berlin_count(): 
    is_berlin = data["store_federal_state"] =="Berlin"
    berlin = data[is_berlin]
    fig = px.scatter_mapbox(berlin, lat="lat", lon="long",color="sector_name", size="visit_duration_minutes", color_discrete_sequence=color_sequence, 
                            category_orders=sector_sequence, size_max=8, zoom=8)
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":1,"t":1,"l":2,"b":2})

    fig.show()
    py.offline.init_notebook_mode(connected=True)
In [16]:
berlin_count()
py.offline.init_notebook_mode(connected=True)

Hamburg is like Berlin in that it is dominated by Discounters and Furniture places. Unlike Berlin, it is less dense and has less users as evidenced by the smaller size of the dots.

In [17]:
def hamburg_count():
    is_hamburg= data["store_federal_state"] =="Hamburg"
    hamburg = data[is_hamburg]
    fig = px.scatter_mapbox(hamburg, lat="lat", lon="long",color="sector_name", size="visit_duration_minutes", color_discrete_sequence=color_sequence, 
                            category_orders=sector_sequence, size_max=8, zoom=8)
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":1,"t":1,"l":2,"b":2})

    fig.show()
    py.offline.init_notebook_mode(connected=True)
In [18]:
hamburg_count()
py.offline.init_notebook_mode(connected=True)

Bayern is denser than Berlin and Hamurg, with more variety in the sectors. This could be due to the fact that area covered is bigger and has more urban centers (and therefore, more populous and have higher incomes). A lot of activities are centered on places like Munich, Nuremberg, and Frankfurt. These places also have more activities for Furniture Stores, which mirror Berlin and Hamburg. Outside of these areas, Bayern has more activities in Fast Food sector, but my feeling is that this could be the big malls/ shopping centers that one see outside city centers. The duration of visits outside the city centers are also much lower and this is more evident in Bayern than in Berlin and Hamburg.

In [44]:
def bayern_count():
    is_bayern= data["store_federal_state"] =="Bayern"
    bayern = data[is_bayern]

    fig = px.scatter_mapbox(bayern, lat="lat", lon="long",color="sector_name", size="visit_duration_minutes", color_discrete_sequence=color_sequence, 
                            category_orders=sector_sequence, size_max=8, zoom=5)
    fig.update_layout(mapbox_style="open-street-map")
    fig.update_layout(margin={"r":1,"t":1,"l":2,"b":2})

    fig.show()
    py.offline.init_notebook_mode(connected=True)
In [45]:
bayern_count()
py.offline.init_notebook_mode(connected=True)

USER DATA OVER TIME

There is clearly a pattern in the behavior of users over time. There are regular peaks and dips in the data, which indicates a behavioral pattern. In general, people go shopping later in the week, peaking on Fridays. Looking at the trend, number of users is stable until the middle of March. This is the start of the lockdown in Germany, when schools and offices started to close. Despite this, the pattern of doing the shopping at the later part of the week stays.

In [21]:
def visit_over_time():
    data_user =  data.groupby(["visit_start_date"])['user_id'].count()
    plt.figure(figsize=(20,8))
    plt.ylabel("volume of visits")
    plt.xlabel("date of visits")
    plt.annotate('start of lockdown in Germany',
                (data_user.index[59], data_user[59]),
                 xytext=(20, 20), 
                 textcoords='offset points')
    plt.plot(data_user)
    plt.show()
    py.offline.init_notebook_mode(connected=True)
In [22]:
visit_over_time()

Where do users go? Here is a breakdown of the shopping trips that users did during the time period. They spent a lot of time buying food. After the lockdown, the amount of visits to Furniture and clothing stores declined. The number of activities in the Baumarkt sector was fairly stable and this could be due to the fact that they are not part of the regular shopping trips for a lot of people.

In [23]:
def visit_breakdown():
    data_user =  data.groupby(["visit_start_date"])['user_id'].count()
    fig, ax = plt.subplots(figsize=(20,10))
    plt.ylabel("volume of visit")
    data.groupby(['visit_start_date', 'sector_name'])['sector_name'].count().unstack().plot(ax=ax, stacked=True, kind="area")
    plt.annotate('start of lockdown in Germany',
                (data_user.index[59], data_user[59]),
                 xytext=(20, 20), 
                 textcoords='offset points')
    py.offline.init_notebook_mode(connected=True)
In [24]:
visit_breakdown()

Volume of users in stores by day. Food shops and Fast Food have the highest volume across the week.

In [25]:
stores =  data.groupby(["sector_name", "visit_day"])['user_id'].count().unstack()
stores
Out[25]:
visit_day Monday Tuesday Wednesday Thursday Friday Saturday Sunday
sector_name
Baumarkt 2198 1453 1975 2374 2603 2486 600
Discounter 24719 16130 21324 26713 27994 24530 7174
Fast Food 7447 5342 7183 8593 9025 7479 5468
Mode 3338 2133 3099 3940 3997 3422 1052
Möbelhäuser 4055 2726 3597 4580 4550 4513 1562

Bayern has the largest number of activities, followed by Berlin. For most part, they mirror the peaks and valleys of each other.

In [26]:
def city_breakdown():
    data_user =  data.groupby(["visit_start_date"])['user_id'].count()
    fig, ax = plt.subplots(figsize=(20,10))
    plt.ylabel("number of visits")
    plt.annotate('start of lockdown in Germany',
                (data_user.index[59], data_user[59]),
                 xytext=(20, 20), 
                 textcoords='offset points')
    data.groupby(['visit_start_date', 'store_federal_state'])['store_federal_state'].count().unstack().plot(ax=ax, stacked=True, kind="area")
    py.offline.init_notebook_mode(connected=True)
In [27]:
city_breakdown()

When do shoppers go? The peak days are Thurs, Friday, and Saturday for all states. Friday is the busiest, while Sunday is the quietest due to the Sunday ruhetag of the retail shops

In [28]:
fig, ax = plt.subplots(figsize=(20,8))
plt.ylabel("number of visits")
data.groupby(['visit_day', 'store_federal_state'])['store_federal_state'].count().unstack().plot(ax=ax, stacked=False, kind="bar")
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6c58897d0>

How long do shoppers stay? Furniture shopping takes longer than other activities. This is followed by clothes shopping. Shopping for food has the lowest average duration at 32 mins. Meanwhile, Berliners tend to shop quicker than people from Bayern or Hamburg.

In [29]:
data.groupby(["sector_name" ])['visit_duration_minutes'].aggregate(np.mean).reset_index()
Out[29]:
sector_name visit_duration_minutes
0 Baumarkt 33.387318
1 Discounter 32.331967
2 Fast Food 33.474405
3 Mode 36.528669
4 Möbelhäuser 38.932729
In [30]:
data.groupby(["store_federal_state" ])['visit_duration_minutes'].aggregate(np.mean).reset_index()
Out[30]:
store_federal_state visit_duration_minutes
0 Bayern 34.071219
1 Berlin 32.906528
2 Hamburg 33.558558
In [31]:
store_day  = data.groupby(["sector_name", "visit_day", "store_federal_state" ])['visit_duration_minutes'].aggregate(np.mean).reset_index()

Below are the charts that support the findings above. On average, people stay longer in Furniture and clothing stores, with peak activity happening on Friday and Saturday.

In [32]:
fig, ax = plt.subplots(figsize=(20,8))
ax= sns.swarmplot(x="visit_day", y="visit_duration_minutes", hue="sector_name", data=store_day, size=10)
ax.set_title("average duration of visit across days by sector")
Out[32]:
Text(0.5, 1.0, 'average duration of visit across days by sector')
In [33]:
fig, ax = plt.subplots(figsize=(20,5))
ax= sns.swarmplot(x="visit_day", y="visit_duration_minutes", hue="store_federal_state", data=store_day, size=10)
ax.set_title("average duration of visits by day broken down by state")
Out[33]:
Text(0.5, 1.0, 'average duration of visits by day broken down by state')

Looking at population data and store

Since the data features location, I thought it would be interesting to see if there's relationship between population density and store information such as sector and length of visit. After all, we assume that places like Aldi or Lidl will do brisk business in residential areas. I found a dataset that has the population of each Postleitzahl in Germany and merged it with the current dataset.

In [38]:
population = pd.read_csv("/home/kla/Documents/plz_einwohner.csv", delimiter=",", header=0, error_bad_lines=False, low_memory=True)
In [39]:
population.columns = ["store_zip", "population"]
In [40]:
data_pop = data.merge(population, on="store_zip", how="left")
In [41]:
data_pop.head(3)
Out[41]:
Unnamed: 0 user_id user_platform_category user_platform_os user_zip store_zip store_city store_id store_federal_state visit_start ... visit_duration_minutes coordinates lat long visit_start_date visit_start_hour visit_end_date visit_end_hour visit_day population
0 133539 8C8CE757-1DFA-41F2-8A92-6BA36EBECEA6 phone.native ios 91301.0 91301 Forchheim 37186 Bayern 2020-01-18 18:56:12.826 ... 12 (49.7186937, 11.0596433) 49.718694 11.059643 2020-01-18 18:56:12.826000 2020-01-18 19:08:12.826000 Saturday 30330
1 464070 f0448e00-55bf-4daa-9f64-480c955dcefa phone.native android 63741.0 63741 Aschaffenburg 742056175 Bayern 2020-03-07 15:45:19.670 ... 69 (49.9740542, 9.1493636) 49.974054 9.149364 2020-03-07 15:45:19.670000 2020-03-07 16:54:19.670000 Saturday 25076
2 451530 4FC2A6B0-B8AE-461C-BC88-4D518773743E phone.native ios 85435.0 85435 Erding 100717560 Bayern 2020-01-30 10:12:24.343 ... 27 (48.3064441, 11.9076579) 48.306444 11.907658 2020-01-30 10:12:24.343000 2020-01-30 10:39:24.343000 Thursday 33535

3 rows × 22 columns

Here we can see that the correlation between population and other factors are very low. This is I think worth investigating. It could be that there is a need to refine location features. Using the zip codes might make a difference but this needs further data retrieval from external sources for the coordinates.

In [42]:
corr = data_pop.corr()
plt.figure(figsize = (10, 8))
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            linewidths=2,
            annot=True,
            cmap='YlGnBu')
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6c5966c50>
In [ ]: